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Absiract— Nowadays, liquid rocket engines use closed- 
loop control at most near steady operating conditions. The 
control of the transient phases is traditionally performed 
in open-loop due to highly nonlinear system dynamics. 
This situation is unsatisfactory, in particular for reusable 
engines. The open-loop control system cannot provide op- 
timal engine performance due to external disturbances or 
the degeneration of engine components over time. In this 
paper, we study a deep reinforcement learning approach 
for optimal control of a generic gas-generator engine’s con- 
tinuous start-up phase. It is shown that the learned policy 
can reach different steady-state operating points and con- 
vincingly adapt to changing system parameters. A quantita- 
tive comparison with carefully tuned open-loop sequences 
and PID controllers is included. The deep reinforcement 
learning controller achieves the highest performance and 
requires only minimal computational effort to calculate the 
control action, which is a big advantage over approaches 
that require online optimization, such as model predictive 
control. 


Index Terms— Liquid rocket engines, intelligent control, 
reinforcement learning, simulation-based optimization 


I. INTRODUCTION 


HE demands on the control system of liquid rocket 
engines have significantly increased in recent years [1], in 
particular for reusable engines. Advanced mission scenarios, 
e.g. in-orbit maneuvers or propulsive landings, require deep 
throttling and re-start capabilities. The aging of reusable en- 
gines also requires a robust control system as the performance 
of engine components might degrade over time, e.g. due to 
soot depositions [2]-[4], increased leakage mass flows caused 
by seal aging [5], or turbine blade erosions [6]. The cost- 
efficient operation of a reusable launch vehicle is only possible 
if the engines possess a long service life without expensive 
maintenance. 
Nowadays, most liquid rocket engines use predefined valve 
sequences to drive the system from the start signal to a desired 
steady-state and to shut down the engine safely. These control 
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sequences are usually determined during costly ground tests. 
Closed-loop control is at most used near steady operating 
conditions to maintain a desired combustion chamber pressure 
and mixture ratio [7]. The resulting lower deviations of the 
controlled variables decrease the amount of extra propellant 
to be carried, which in turn increases the payload capacity 
of the launch vehicle. Although the importance of closed-loop 
control has been evident for many years, the majority of rocket 
engines still employ valves which are operated with pneumatic 
actuators, too inefficient for a sophisticated closed-loop control 
system. The development of an all-electric control system 
started in the late 90s in Europe [8]. The future European 
Prometheus engine will have such a system [9]. Other coun- 
tries are also well advanced in the research and development 
of electrically operated flow control valves [10]. Due to the 
electrification of the actuators and the grown demands, the 
interest in closed-loop solutions increased recently and will 
further rise in the future. 

Furthermore, optimal control of the engine operation, in- 
cluding the transient phases, is the only way to realize high 
performing systems, which also comply with the aforemen- 
tioned demands on the control system of future liquid rocket 
engines [11]. One way to solve optimal control problems is 
to use reinforcement learning (RL). Although the application 
of such modern methods of artificial intelligence seems un- 
orthodox in this setting, it offers certain advantages. First, 
given a suitable simulation environment, RL algorithms can 
automatically generate optimal transient sequences. Second, 
the trained RL controller features a minimal computational 
effort to calculate the control action, so it can easily be used for 
closed-loop control of the demanding transient phases. Third, 
RL is perfectly suited for complex control tasks, including 
multiple objectives and multiple regimes [12]. Optimal control 
using RL [13] has been studied in many different areas, 
from robotics [14], [15] and medical science [16] to flight 
control [17], [18] and process control [19]. Furthermore, the 
benefits of an intelligent engine control system, where artificial 
intelligence techniques are used for control reconfiguration and 
condition monitoring, have already been investigated in the 
space shuttle area [20], [21]. 

The objective of our work is analogous to the investigation 
of Pérez-Roca et al. [22], where a model predictive control 
(MPC) approach to control the start-up transient of a liquid 
rocket engine was studied. After the derivation of a suitable 
state-space model [23], a linear MPC controller was synthe- 
sized. The controller completes the start-up and can track 


the end-state references with sufficient accuracy. MPC and 
RL have specific advantages and disadvantages. The work 
presented here aims to evaluate the capabilities and limitations 
of RL for liquid rocket engine control. 
Our main contributions are the following: 
e formulation of optimal start-up control as a RL problem 
e training and evaluation of the RL controller for multiple 
operating conditions and degrading turbine efficiencies 
e quantitative comparison with carefully tuned open-loop 
sequences and PID controllers 
The remainder of this paper is structured as follows: Section 
describes the basics of RL and presents pseudocode of 
the used RL algorithm. The simulation environment and its 
coupling with the RL algorithms are outlined in section 
Section discusses the test case. Section reports the 
results, including the comparison with the performance of PID 
controllers. Finally, section [V]] provides concluding remarks. 


Il. REINFORCEMENT LEARNING 


In this section, we review basic RL concepts [24]. RL 
algorithms can be used to solve optimal control problems 
stated as Markov decision processes (MDPs) [25]. MDPs 
provide a mathematical framework for modeling decision 
making in situations where the system changes possibly in 
a stochastic manner. Standard MDPs work in discrete time: 
at each time step, the controller (usually called the agent in 
RL) receives information on the state of the system and takes 
an action in response. The decision rule is called a policy 
in RL. The action changes the state of the system, and the 
latest transition is evaluated via a reward function. The optimal 
control objective is to maximize the (expected) cumulative 
reward from each initial state. Formally, an MDP consists of 
the state-space X of the system, the action (input) space U, 
the transition function (dynamics) f of the system, and the 
reward function p (negative costs). Due to the origins of the 
field in artificial intelligence, the usual notation would be S 
for the state-space, A for the action space, P for the dynamics, 
and R for the reward function. In this paper, notation inspired 
by control theory is used. As a result of the action ux applied 
in state x, at discrete time step k, the state changes to 7,41 
and a scalar reward rp41 = p(£k, Uk, &k+1) is received. The 
goal is to find a policy 7, so that ux = 7(a,), that maximizes 
the cumulative reward, typically the expected discounted sum 
over the infinite horizon: 


Lo si~f (test (@k),-) os yF o(re, T(E), = (1) 


k=0 


where y € (0, 1] is the discount factor. The mapping from a 
state xo to the value of the cumulative reward for a policy 7 
is called the (state) value function V” (xo): 


V” (zo) = 


voegi~f (we n(Ek)) 5 Y" plains T(rx), k41) | i 


k=0 
(2) 
The control objective is to find an optimal policy z* that leads 
to the maximal value function, for all xo: 


V* (aq) := max V” (zo), V£o (3) 
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Although state-values functions suffice to define optimality, it 
is useful to define action-value functions, called Q-functions. 
The action-value function gives the expected reward if one 
starts in state x, takes an arbitrary action u (which may 
not have come from the policy), and then forever after acts 
according to policy 7: 


Q(z, u) = Leics? (zu, VLPs Uu, x’) =P yV” (eh, (4) 


where the prime notation indicates quantities at the next 
discrete time step. The optimal Q-function Q* is defined using 
V*. Once an optimal Q-function Q* is available, an optimal 
policy 7* can be computed by 


w(x) € arg max Q* (zx, u), (5) 


while the formula to compute z* from V* is more compli- 
cated. As a consequence of the definitions, the Q-functions 
Q” and Q* fulfill the Bellman equations: 


Q* (x, U) = Ey nr p(au,){(@, u, 2") + YQ" (x", w(2"))} (6) 
and 
Q*(z, u) = Dt of a; PEs U, x’) TY man Q* (2', u')}, 


(7) 
which are of central importance in RL. The crucial advantage 
of RL algorithms is that they do not require a model of the 
system dynamics. Instead, an optimal policy can be found by 
learning from samples of transitions and rewards. The problem 
formulation with MDPs and the associated solution techniques 
also handle nonlinear, stochastic dynamics, and nonquadratic 
reward functions. Perhaps the most popular RL algorithm is 
Q-learning. In Q-learning, one starts from an arbitrary initial 
Q-function Qo and updates it using observed state transitions 
and rewards. The update rule is of the following form: 


Qr+1(Tk, Uk) = Qk(Tk, Uk) 
+ aK [rapa + ymax Qk(te41,¥') — Qe (Te, Ux)], (8) 


where ay € (0,1] is the learning rate. The term inside the 
square bracket is nothing else than the difference between 
the updated estimate of the optimal Q-value of (xz, uz) 
and the current estimate Q;(x,, ux). Under mild assump- 
tions on the learning rate and that a suitable exploratory 
policy is used to obtain samples, i.e. data tuples of the form 
(Zk, Uk, Lk+a,Tk+1), Q-learning asymptotically converges to 
Q*, which satisfies the Bellman optimality equation. The 
reader is referred to [26] for a description of similar RL 
algorithms. Q-learning and its many variants require that Q- 
functions and policies are exactly represented, e.g. as a table 
indexed by the discrete states and actions. Especially for 
the control of physical systems, the states and actions are 
continuous; moreover, exact representations are in general 
impossible. Normal Q-learning does not work in this setting. 
Fortunately, methods like Q-learning can be combined with 
function approximation. We denote approximate versions of 
the Q-function and the policy by Q(x, u;@) and ĉ(x; w), 
where 0 and w are the parameters of parametric approximators. 
There are many different function approximators to choose 
from. 


WAXENEGGER-WILFING et al.: A REINFORCEMENT LEARNING APPROACH FOR TRANSIENT CONTROL OF LIQUID ROCKET ENGINES 3 


The combination of RL with deep neural networks (DNNs) 
as function approximators leads to the field of deep RL. In 
the last years, deep RL algorithms have achieved impressive 
results, such as reaching super-human performance in the 
game of Go. Besides the sensational results in board games 
or video games, those algorithms are successfully used in 
areas like robotics. In deep Q-learning, one uses a neural 
network to approximate the Q-function. Neural networks can 
represent any smooth function arbitrarily well given enough 
parameters, and therefore they can learn complex Q-functions. 
Loss functions and gradient descent optimization are used to fit 
the parameters of the models. Gradient estimates are usually 
averaged over individual gradients computed for a batch of 
experiences. 

Nevertheless, the simple training procedure is unstable, 
because sequential observations are correlated, and techniques 
like experience replay have to be used. Correlated experiences 
are saved into a replay buffer. When batches of experiences are 
needed for training, these batches are generated by sampling 
from the replay buffer in a randomized order. A further 
reason for the simple training procedure’s instability is that 
the target values depend on the parameters one wants to 
optimize. The solution is to use a so-called target network, 
Q(a,u;0~), with target parameters 67, which slowly track 
the online parameters. While deep Q-learning solves problems 
with continuous state-spaces, it can only handle discrete and 
low-dimensional action spaces. The reason for that is the 
following: (deep) Q-learning requires fast maximization of 
Q-functions over actions. When there are a finite number of 
discrete actions, this poses no problem. However, when the 
action space is continuous, this is highly non-trivial (and would 
be a very computational expensive subroutine). 

The deep deterministic policy gradient (DDPG) [27] algo- 
rithm is specially adapted for environments with continuous 
action spaces. It uses neural networks to approximate both 
the Q-function and a deterministic policy, i.e. the policy 
network deterministically maps a state to a specific action. For 
exploration, one adds noise sampled from a stochastic process 
N to the actions of the deterministic policy and updates it by a 
gradient-based learning rule. As in deep Q-learning, the DDPG 
algorithm uses a replay buffer and target networks to improve 
stability during neural network training. Further details of the 
DDPG algorithm and its performance on different simulated 
physics tasks are given by Lillicrap et al. [27]. 

Although the DDPG algorithm is quite powerful, it has 
a direct successor, the Twin Delayed DDPG (TD3) [28] 
algorithm, which further improves the stability by employing 
three critical tricks. The first trick addresses a particular failure 
mode of the DDPG algorithm: if the Q-function approximator 
develops an incorrect sharp peak for some actions, the policy 
will quickly exploit that peak and then have brittle or incorrect 
behavior. This failure mode can be averted by smoothing out 
the Q-function over similar actions. For this, one computes 
the action that is used to form the Q-learning target in the 
following way: 


u'(x') = clip(#(x’; w7) + clip(e€, —c, €), £Low; Ligh), (9) 


where e ~ N (0, a) is noise sampled from a Gaussian process 


Algorithm 1 Twin Delayed DDPG (TD3) 
1: Input: initial policy parameters w, Q-function parameters 
01, 02, empty replay buffer D 
2: Set target parameters equal to main parameters 
w~ + w, ĝi + 01, 03 + b2 
3: repeat 
4: Observe state x and select action 
u = clip(#(x; w) + €, Crow, Thigh), Where € ~ M 
5: Execute u in the environment 
6: Observe next state x’, reward r, and done signal d to 
indicate whether x’ is terminal 
7. Store (x,u,r,x’,d) in replay buffer D 
8: If x’ is terminal, reset environment state 
9: if it is time to update then 


10: for j in range(however many updates) do 
11: Randomly sample a batch of transitions 
B = {(x,u,r,x',d)} from D 

12: Compute target actions 


u(x’) = clip(#(2"; w7 ) + clip(e, —c,c), 
e ~ N(0,c) 


ZLow; Thigh) ’ 
13: Compute targets 


g(r,a",d) =r + (1 — d) min Q(a',u'(2"); 6; ) 


14: Update Q-functions by one step of gradient descent 


Te] SD Q(x, us 6) — a(r, 2’, d))?, 


(x,u,r,x',d)EB 


for i = 1,2 
15: if j mod policy delay = 0 then 
16: Update policy by one step of gradient ascent 
1 Kh 
Voor Da Ole (052) 01) 
ZEB 
17: Update target networks 


6; «+ (1—7)0; + 76;, 


w <—(l-T)w +7w 


for i = 1,2 


18: end if 
19: end for 
20: end if 


21: until convergence 


(target policy noise). The action is based on the target policy, 
but with clipped noise added (target noise clip c). After adding 
the noise, the target action is also clipped to lie in the valid 
action range (Low, High). The second trick is to learn two 
Q-functions Q(«, u;9;), for i = 1,2, instead of one and 
use the smaller of the two Q-values to form the target. This 
improvement reduces overestimation in the Q-function. 


g(r,2",d) =r +(1—d) min Q(a',u'(2’);6;) (10) 


The third trick is to update the policy less frequently than the 
Q-functions (policy delay) to damp the volatility that arises in 
the DDPG algorithm. Algorithm [I] shows the full pseudocode 
of the TD3 algorithm. The done signal d is equal to one when 
x’ is the terminal state and otherwise equal to zero. The done 
signal guarantees that the agent gets no additional rewards 
after the current state at the end of an episode. 

In addition to enhancements that improve the stability of 
the training process, research is also carried out to speed 
up the learning process of RL agents [29]. Besides DDPG, 
TD3, or SAC [30], which are so-called off-policy algorithms, 
there are also state-of-the-art on-policy algorithms like TRPO 
[31] or PPO [32]. Nevertheless, on-policy methods are much 
more sample inefficient and have longer training time to 
achieve equivalent performances. From a control perspective, 
reinforcement learning converts the system identification 
problem and the optimal control problem to machine learning 
problems. Similar to explicit model predictive control it 
also addresses the problem of removing one of the main 
drawbacks of model predictive control, namely the need to 
solve a complex optimization problem online to compute the 
control action. 


The main advantages of RL for control: 


e no derivation of a suitable state-space model, model order 
reduction or linearization needed 

e direct use of a nonlinear simulation model 

e ideal for highly dynamic situations (no complex online 
optimization needed) 

e complex reward functions enable complicated goals 


The main disadvantage of RL for control: 


e stability of the controller is in general not guaranteed 


Concerning the last point (stability), we would like to make a 
remark. The controller’s output can always be tested using the 
simulation environment, and there has been promising recent 
work on certifying stability of RL policies [33]. 


II]. SIMULATION ENVIRONMENT AND RL 
IMPLEMENTATION 


A suitable simulation environment for our intended use 
is given by EcosimPro [34]. EcosimPro is a modeling and 
simulation tool for OD or 1D multidisciplinary continuous and 
discrete systems. The system description is based on dierential- 
algebraic equations and discrete events. Within a graphical 
user interface, one can combine dierent components, which 
are arranged in several libraries. Of particular interest are 
the European Space Propulsion System Simulation (ESPSS) 
libraries, which are commissioned by the European Space 
Agency (ESA). These EcosimPro libraries are suited for the 
simulation of liquid rocket engines and have continuously been 
upgraded in recent years. 

We use the TD3 implementation of Stable-Baselines [35]. 
Stable-Baselines is a set of improved implementations of RL 
algorithms based on OpenAI Baselines. It features a common 
interface for many modern RL algorithms and additional 
wrappers for preprocessing, monitoring, and multiprocessing. 
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Fig. 1. Flow plan of the considered engine architecture. Some of 
the propellants are burned in an additional combustion chamber, the 
gas-generator (GG), and the resulting hot-gas is used as the working 
medium of the turbines which power the engine’s pumps. The gas is 
then exhausted. The engine architecture features five valves, but only 
three valves (VGH, VGO, VGC) are used for closed-loop control. 


We encapsulate our simulation environment into a custom 
OpenAI Gym environment using an interface between Ecosim- 
Pro and Python. Hence, we can directly use Stable-Baselines 
for training and testing. A big advantage of the RL approach 
is that it works regardless of whether one uses a lumped 
parameter model, continuous state-space models, surrogate 
models employing artificial neural networks [36], [37], or a 
combination of the above. 


IV. TEST CASE 


The engine architecture considered to study the suitability 
of an RL approach for the control of the transient start-up 
is shown in Fig. It is similar to the architecture of the 
European Vulcain 1 engine [38], which powered the cryogenic 
core stage of Ariane 5 launch vehicle before it got replaced by 
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Fig. 2. 100 bar nominal open-loop start-up sequence. The main combustion chamber pressure settles at 100 bar, while the gas-generator pressure 
reaches 75 bar. The reference mixture ratios are given by 5.2 and 0.9. The engine reaches steady-state conditions after approximately 4s. 


the upgraded Vulcain 2 engine. It is fed with cryogenic liquid 
oxygen (LOX) at a temperature of 92 K and liquid hydrogen 
(LH2) at 22 K. The engine generates approximately 1 MN of 
thrust at a main combustion chamber (CC) pressure of 100 bar 
and a chamber mixture ratio of 5.6, i.e. the chamber mass 
flow of the oxidizer divided by the chamber mass flow of the 
fuel equals 5.6. The engine cycle is an open gas-generator 
cycle, where a small amount of the propellants is burned in a 
small combustion chamber, the gas-generator (GG). The gas- 
generator is operated at a fuel-rich mixture ratio of 0.9. The 
produced hot-gas is used to drive the turbines before it is 
exhausted. The turbines power the pumps which force the 
propellants into the combustion chambers. LH2 is used to 
cool the nozzle and main combustion chamber before it gets 
burned. A convergent-divergent nozzle, which usually includes 
an uncooled nozzle extension (NE), accelerates the combustion 
gases to generate thrust. 

The actuators are given by five flow control valves (VCO, 
VCH, VGO, VGH, VGC). VCO and VCH are the main 
combustion chamber valves that regulate the propellant flow 
to the combustion chamber. VGO and VGH, the gas-generator 
valves, are used to control the gas-generator pressure and 
mixture ratio. The turbine valve, VGC, is located downstream 
of the gas-generator and is used to determine the hot-gas flow 
ratio between the LOX and LH2 turbines. Thus, this valve 
mainly influences the global mixture ratio (PI, pump-inlet). 
Further actuators are the ignition systems (IGN) for the main 
combustion chamber and the gas-generator, as well as a turbine 


starter. The turbine starter produces hot-gas for a short period 
to spin up the turbines during the start-up. 

To start the engine and reach steady-state conditions, a 
succession of discrete events, including valve openings and 
chamber ignitions are necessary. The start-up sequence of 
an engine, i.e. the chronological order of oxidizer and fuel 
valve openings, as well as the precise ignition timings, deter- 
mines the engine’s thermodynamic conditions and mechanical 
stresses during start-up. A non-ideal start-up sequence can 
damage the engine, e.g. by excessive temperatures. These high 
temperatures can substantially damage the turbine blades or 
at least reduce their live expectancy [39]. An optimal start- 
up sequence leads to a smooth ignition of the combustion 
chamber and gas-generator with low thermal and mechanical 
stresses. An open-loop start-up sequence (OLS) for a steady- 
state chamber pressure of 100 bar is shown in Fig. The 
sequence does not correspond exactly to the Vulcain 1 start-up 
sequence, but it is realistic for such an engine cycle. The flow 
control valves are opened monotonically until the end positions 
are reached. First the VCH valve starts to open at t = 0.15, 
followed by VCO at t = 0.6s. A fuel-lead transient is usually 
used for a smooth ignition of the combustion chamber, which 
takes place at t = 1.0s. At this point, the main combustion 
chamber is burning at low pressure, only fed by the tank 
pressurization. At t = 1.1s, the turbine starter activates to 
spin up the turbopumps, which start to build up the pressure in 
the main combustion chamber and at the gas-generator valves 
VGO, VGH. At t = 1.4s and t = 1.5s, the gas-generator 


valves VGH and VGO open and the gas-generator is ignited. 
The VGC valve is set to a fixed position during the entire start- 
up sequence. At t = 2.6s, the turbine starter is burned out and 
the engine reaches steady-state conditions after approximately 
4s. The valve positions in Fig. |2} are tuned to reach a main 
combustion chamber pressure pec of 100 bar, a global mixture 
ratio MRp, of 5.2 and a gas-generator mixture ratio MRaq of 
0.9. 

Although RL can solve discrete or hybrid control problems, 
there are controllability and observability issues during the first 
phase of discrete events due to very low mass flows [21]. Thus 
we focus on the fully continuous phase starting at t = 1.5s. 
The goal of the controller (agent) is to drive the engine as 
fast as possible towards the desired reference by adjusting the 
flow control valve positions. In our multi-input multi-output 
(MIMO) control tasks, only three flow control valves, VGO, 
VGH, and VGC, are used for active control of the combustion 
chamber pressure, and the mixture ratio of the gas-generator 
as well as the global mixture ratio. The valve actuators are 
modeled as a first-order transfer function with a time constant 
of r = 0.05s and a linear valve characteristic. The minimum 
valve position is set to 0.25 for VGH and VGO and 0.20 for 
VGC, respectively. The maximum valve position is 1.0 for all 
valves. 

We study different reference values for the combustion 
chamber pressure, namely 80 and 100 bar. The reference mix- 
ture ratios remain the same, 5.2 for the global mixture ratio and 
0.9 for the mixture ratio of the gas-generator. For a combustion 
chamber pressure of 80 bar, the valve timings are the same, 
but the final valve positions were adjusted accordingly (see 
Fig. p}. Furthermore, we study the effect of degrading turbine 
efficiencies on the start-up transient. This scenario has practi- 
cal relevance for future reusable engines. The use of cryogenic 
propellants leads to significant thermostructural challenges in 
the operation of turbopumps. Since thermal stresses depend 
on the temperature gradient, they can cause significant loads 
on the metal parts that have to react to these stresses. The 
resulting fatigue deformation [39] affects the performance 
of the turbines. Furthermore, the aging of seals can cause 
increased leakage mass flows, which in turn decreases the 
turbine efficiency [5]. Additional reasons are turbine blade 
erosions [6] and soot depositions on the turbine nozzles by 
fuel-rich gases when using hydrocarbons as fuel. These soot 
depositions can decrease the effective nozzle area up to 20% 
[2], thus reducing the turbopump performance. Furthermore, 
soot depositions are a main shortcoming for reusable engines 
due to the unpredictable impacts for engine re-start [40]. 
To study the effect of degrading turbine efficiencies for our 
generic test case, we simulate and evaluate the performance 
of the open-loop start-up sequence, a family of PID controllers, 
and our RL-agent for 16 different combinations of LOX 
and LH2 turbine efficiencies. For each turbine, 4 different 
efficiencies are considered ranging from 100% to 85 % of the 
nominal value. 

The reward, which is used to evaluate a start-up sequence 
and to train the RL agent consists of 3 different terms: 


(1) 


T = Tsp + TGG + Tvalve- 
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The first term 


m — ot; 
Tsp = — DD clip ( A 


T; 
x i,ref 


(12) 


02) 


for x; € [pcc, MRaa, MRp;] penalizes deviations from the 
desired set-point for all controlled variables. Each reward 
component in this term is clipped to a maximum value of 
0.2 to improve training and to balance the accumulated reward 
during start-up and steady-state. The second term of the reward 


MRac — MR et if MRac 
rea =— MRGG ref MRGG,ref (13) 
0, otherwise 


additionally penalizes high mixture ratios in the gas-generator. 
High mixture ratios are dangerous because they result in 
increased temperatures and thus possible damaging conditions 
to the turbines. The last reward component 


_ [sven] + [svao] + |svec| 
Tvalve = 3 , 


(14) 


where s is the change in valve position between two time steps, 
penalizes excessive valve motion. By adding this component, 
we encourage the agent to move the valves as little as 
possible to avoid valve wear, valve oscillations, and valve 
jittering. All together, this reward allows the agent to trade 
off between reaching the desired reference point as fast as 
possible, avoiding steady-state errors, minimizing overshoots, 
and reducing valve motion as much as possible. Fig. [2] shows 
all 3 components of the cumulative reward for the nominal 
OLS for 100 bar. Since the valves are only moved once in the 
OLS, the contribution of fyalve to the total reward is low. As 
the overshoot in the gas-generator mixture ratio is also small 
(small raa), the total reward is mainly composed of the set 
point error rsp. 

To train and use a RL agent, one needs to define the obser- 
vation and action space of the agent. The observation space, 
i.e. the variables the agent receives from the environment at 
each time step, should at least contain sufficient information 
to unambiguously define the state of the system. In our set-up, 
the observation space 


X = [Pec,refy Ecc, EPI, EGG, Posvco, Posvau, Posvac, 


WLOX; WLH2] (15) 


contains 9 variables, where €; = £i — £i ref iS the absolute error 
for each controlled variable, Posvao, Posyvau, and Posvac 
are the positions of all control valves, and wox and wz p2 are 
the rotational speeds of the turbopumps. The observation space 
is normalized with the reference steady-state values. All vari- 
ables in our observation state are measurable in real engines. 
Thus our approach is not limited to simulation environments, 
where one could possibly use variables that are impossible to 
measure directly in real engines (e.g. the turbine efficiencies). 
The agent’s action space U consists of all 3 gas-generator 


valve positions 
U = [Posyco, Posvan, Posvac]. (16) 


At each time step, the RL agent receives observations from 
the environment and sends control signals to the flow control 
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Fig. 3. Manipulated valve positions by the PID controllers for the 
100 bar nominal start-up. VGO is used to control the mixture ratio of 
the gas-generator, while VHG and VGC control the pressure of the main 
combustion chamber and the global mixture ratio respectively. 
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Fig. 4. Manipulated valve positions by the RL agent for the 100 bar 
nominal start-up. The action clearly changes at t = 2.6 s, which is the 
time when the firing of turbine starter stops. 


valves of the engine. The frequency of interaction between 
the controller (RL-agent and PID) and the environment is set 
to 25 Hz. 


V. RESULTS 


In this section, we assess the performance of our RL 
controller. For this we use the approximation of the integrated 
absolute error over one entire episode for each controlled 
variable: 


(Az); = | laldi = Delt). (17) 
tj 


where t; are the discrete time steps. Furthermore, we evaluate 
the average steady-state values of the controlled variables from 
t = 3.5s to t = 5.0s and the value of the cumulative reward. 

Before we turn to the performance of closed-loop control, 
let us record the downsides of open-loop sequences (OLS). 
The first column in Fig. [5] shows the resulting engine start-up 


for the nominal OLS and degrading turbine efficiencies. For 
the latter, the steady-state values deviate strongly from the 
reference values. The minimum steady-state value of the main 
combustion chamber pressure is 92 bar. The steady-state of the 
global mixture ratio varies between 4.9 and 6.0. To prevent 
fuel or oxidizer from running out during a mission in the event 
of a persisting mixing ratio deviation, the loaded propellants 
must be increased, which reduces the payload capacity of the 
launch vehicle. A further negative effect is that the temperature 
in the combustion chamber can rise significantly due to a shift 
in the mixing ratio, which could reduce the engine’s service 
life. Additionally, the steady-state value of the mixture ratio 
of the gas-generator changes too. The temperature in the gas- 
generator is sensitive to the mixture ratio, and an increased 
temperature can also damage the turbines. These damaging 
conditions are especially problematic for reusable engines, 
which must possess a long service life. The same implications 
apply to the 80 bar case as Fig. [B] shows. 

Those unfavorable effects can be counteracted with a 
closed-loop control system. First, we tune a family of PID 
controllers to achieve the start-up. The process of controlling 
the chamber pressure of the main combustion chamber, the 
mixture ratio of the gas-generator, and the global mixture 
ratio by manipulating VGO, VGH, and VGC is coupled. 
E.g. changing VGO does affect not only the mixture ratio 
of the gas-generator but also the other two controlled vari- 
ables. Nevertheless, for rocket engine control near steady- 
state conditions, the standard approach is to use separate 
PID controllers and tune the control loops at different speeds 
to avoid oscillations [7]. Hence, we also use three separate 
controllers. 

The first controller manipulates VGO to control the mixture 
ratio of the gas-generator, the second controller manipulates 
VGH to control the chamber pressure of the main combustion 
chamber, and the third controller manipulates VGC to control 
the global mixture ratio. Starting far away from the reference 
point can be problematic for a simple PID controller because 
the integrator begins to accumulate a significant error during 
the rise. Consequently, a large overshoot may occur. Modern 
PID controllers use different methods to address this problem 
of integrator-windup. We use a simple feedback loop, where 
the difference between the actual and the commanded actuator 
position is fed back to the integrator, to avoid the effects of 
saturation. If there is no saturation, our anti-windup scheme 
has no effect. The ratio between the time constant for the anti- 
windup and the integration time is 0.1 for all PID controllers. 

For PID parameter tuning, we directly use the simulation 
model coupled with a genetic algorithm [41] of the Distributed 
Evolutionary Algorithms in Python (DEAP) framework [42]. 
To guarantee a fair comparison, we use the reward function to 
calculate the fitness value of a certain parameter combination. 
Table [IV] presents the optimal PID parameters, which maxi- 
mize the reward function. The genetic algorithm uses a popu- 
lation of 5000 valid individuals and evolves the population for 
20 generations. Fig. [3] show that the best PID controllers open 
the valves in a nonmonotonic way, which leads to a faster 
start-up. Furthermore, the PID controllers fulfill their main 
task: the feedback loops lead to an adjustment of the valve 
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Fig. 5. Comparison of the controlled variables for the 100 bar start-up. Shaded area marks the range of the controlled variable for different degraded 
efficiencies. At different turbine efficiencies the standard open-loop sequence provides significantly different steady-state values for the chamber 
pressure and the mixture ratios. 
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TABLE | 
CONTROLLER PERFORMANCE FOR NOMINAL TURBINE EFFICIENCIES 


Target Algo. Reward Steady-State Values IAE 
Pcc cum. pcc MRaa MRpr poo MRac MRpy 
(bar) O (bar) © © (bar) © CG) 
100 OLS -7.9 100.0 0.90 5.18 591 4.7 27 
PID -6.5 98.9 0.90 5.17 632 4.0 19 
RL -4.2 99.9 0.90 5.18 519 2.8 13 
80 OLS -7.0 79.7 0.90 5.20 576 6.0 15 
PID -5.2 79.9 0.90 5.20 433 3.8 9 
RL -4.4 80.8 0.90 5.18 366 2.9 8 


positions at lower turbine efficiencies and significantly reduce 
the deviations from the reference values of the controlled 
variables. Due to the structure of PID controllers, with their 
proportional, integral, and derivative terms, the shape of the 
control input is restricted and does not provide optimal control. 

Fig. |5| shows that the optimized PID controllers lead to 
certain overshoots of the main combustion chamber pressure 
and the global mixture ratio. It is possible to eliminate the 
overshoots by changing the PID parameters, but this would 
significantly increase the settling time. For our parameters, 
there is still an error in combustion chamber pressure after 4s 
even for nominal efficiencies. The settling time is not the only 
reason for a large error in the combustion chamber pressure in 
the case of lower turbine efficiencies. For the lowest turbine 
efficiencies, a combustion chamber pressure of 100 bar is 
physically no longer possible while maintaining the other 
constraints (especially the desired gas-generator mixture ratio). 
A specific disadvantage of PID controllers is that degenerating 
efficiencies or other system parameters cannot be considered 
directly as further input variables. Fig. [6] shows that for the 
80 bar start-up VGC oscillates a little. It is challenging to 
tune a single family of PID controllers for different reference 
combustion chamber pressures. For even lower combustion 
chamber pressures (deep throttling), it becomes more and 
more difficult to achieve a convincing performance for all 
operating conditions. The prevention of oscillations leads to 
an increased settling time for all reference values. All in 
all, the performance of the PID controllers is not perfect but 
satisfactory for the case of 100 and 80 bar and fixed mixture 
ratios. 

Now we examine the performance of our RL approach. 
The comparison of Fig. [3] and Fig. [4| shows that at first 
glance the RL agent’s behavior shows strong similarities to 
the PID controllers. The flow control valves are opened in a 
nonmonotonic way. Nevertheless, the agent can guarantee an 
even faster start-up, as presented in Fig.|5| The RL controller 
can better control the combustion chamber pressure and the 
global mixing ratio. The control of the gas-generator mixture 
ratio is comparatively good. Furthermore, the RL agent can 
directly take the firing of the turbine starter into account. 
The action changes at t = 2.6s, which is the time when the 
firing of turbine starter stops. Similar to the PID controllers, 
the RL agent can handle degrading turbine efficiencies to a 


certain extent. It can detect deviating efficiencies because the 
relationship between valve positions and controlled variables 
changes, and adjusts the start-up. A prerequisite for this is that 
the valve positions are also included in the observation space, 
and that experiences with different efficiencies were generated 
during the training. 

Table[I] compares the rewards, steady-state values, and [AEs 
of the studied approaches for nominal turbine efficiencies 
and both main combustion chamber pressures of 100 bar and 
80 bar. The open-loop sequences are satisfying for the nominal 
start-ups. Nevertheless, both JAEs and rewards show that 
improvement is possible. One can start up faster if the valves 
are opened nonmonotonously. Why is this not done for realistic 
start-up sequences? As already mentioned, it is common 
practice to determine the control sequences employing tests 
on test benches, which is expensive and time-consuming. With 
non-reusable engines, the demands on the control system are 
not so dramatic, and one can accept good but not optimal 
sequences as long as a large amount of development costs is 
saved. Another reason is that, as a rule, disturbances influence 
the start-up anyway and cancel out the advantages of optimized 
sequences. The advantages can only be realized by closing 
the control loop. The tuned PID controllers are better than the 
open-loop sequences concerning the value of the reward. The 
RL agent is even better. The RL agent and the PID controllers 
also achieve decent steady-state values. 

Table[II|compares the rewards, steady-state values, and [AEs 
of the studied approaches for degrading turbine efficiencies. 
We present the mean and standard deviation of the measures 
instead of giving all values for the 16 different combinations of 
turbine efficiencies. For the steady-state values, the minimum 
and maximum values are also listed in Table [II| As already 
seen in Figure the OLS results in large deviations for 
degrading turbine efficiencies. For 100 bar, the steady-state 
main combustion chamber pressure ranges between 92 bar to 
100 bar. Furthermore, degrading turbine efficiencies strongly 
influence the overall mixture ratio MRp;. Large deviations 
in MRp, (here from 4.9 to 6.0) poses two major problems. 
First, the fuel and oxidizer tank volumes are designed for 
the nominal mixture ratio. Deviations in MRp, result in a 
non-ideal utilization of the propellants, thus lowering the 
launcher’s performance. Second, the mixture ratio in the main 
combustion chamber is directly affected by the overall mixture 
ratio, potentially resulting in more damaging conditions for 
the main combustion chamber. The cumulative reward for the 
OLS increases to a mean value of —14.9 with a large standard 
deviation of 5.2. 

The controller performances of both closed-loop controllers 
highlight the benefits of closed-loop control for degrading 
turbine efficiencies. The mean and standard deviations of the 
cumulative rewards are much smaller for the PID controllers 
and the RL agent. The additional reduction for the RL agent 
is mainly due to an even faster start-up. The mean steady- 
state value of pcc is given by 96.0 bar for the agent, which 
is a little bit closer to 100.0bar than the value of PID 
controllers and much closer than the value of the open-loop 
sequence. Furthermore, the maximum deviation is the smallest. 
The advantages of closed-loop control and especially the RL 
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TABLE II 
CONTROLLER PERFORMANCE FOR 16 DIFFERENT COMBINATIONS OF DEGRADED TURBINE EFFICIENCIES 


Target Algo. Reward Steady-State Values Integral Absolute Error JAE 
pcc (bar) cumulative (—) pcc (bar) MRga ©) MRpy ©) pco (bar) MRac C) MRpy ©) 
mean sd min max mean sd min max mean sd min max mean sd mean std mean std mean std 
100 OLS -14.9 5.2 92.0 100.0 96.1 2.3 0.87 0.96 0.91 0.03 49 60 54 0.3 1841 151 5.8 1.0 45 18 
PID -7.6 0.7 95.5 98.9 97.7 1.0 0.89 0.90 0.89 0.01 5.0 5.2 5.2 0.1 758 98 41 0.1 25 4 
RL -5.5 0.9 96.0 100.7 98.8 1.4 0.89 0.90 0.90 0.02 5.1 53 5.2 0.0 636 91 2.9 0.1 29 6 
80 OLS -15.4 5.8 71.1 79.7 75.5 2.4 0.86 0.96 0.91 0.03 48 62 55 04 815 149 7.0 0.9 36 21 
PID -5.7 0.6 79.5 79.9 79.7 0.1 0.90 0.90 0.90 0.00 5.2 5.5 5.2 0.1 459 18 39 0.1 12 
RL -4.4 0.8 79.6 82.1 80.3 0.6 0.89 0.90 0.90 0.00 5.2 54 5.2 0.0 366 30 33 0.1 10 5 


For each turbine, 4 different efficiencies are considered ranging from 100% to 85% of the nominal value. 


approach are also reflected in the mixture ratios, which are 
much closer to their nominal values compared to the OLS. 
The JAEs also show that the RL agent performs better than 
the PID controllers. 


VI. CONCLUSION AND OUTLOOK 


In this work, we presented a RL approach for the optimal 
control of the fully continuous phase of the start-up of a 
gas-generator cycle liquid rocket engine. Using a suitable 
engine simulator, we employed the TD3 algorithm to learn 
an optimal policy. The policy achieves the best performance 
compared with carefully tuned open-loop sequences and PID 
controllers for different reference states and varying turbine 
efficiencies. Furthermore, the prediction of the control action 
takes only 0.7 ms, which allows a high interaction frequency, 
and in comparison to MPC enables the real-time use of RL 
algorithms for closed-loop control. The modest computational 
requirements should be met by the current generation of engine 
control units. A potential drawback of the RL approach is the 
lack of stability guarantees. Nevertheless, the control system 
can be tested using a high fidelity simulation model, and there 
is ongoing work on certifying stability of RL policies [33]. 

The present work can be improved in many directions. 
It is necessary to carefully examine the performance of the 
controller when various disturbances occur. Disturbance rejec- 
tion, integration of filtering, and observer design will be the 
focus of future work. Furthermore, even the most sophisticated 
models usually have prediction errors due to not included 
effects or model miss-specifications. Therefore, it is essential 
to ensure that controllers trained in a simulation environment 
are robust enough to be used in real applications. There are RL 
approaches that explicitly consider modeling errors. Domain 
randomization [43] can produce agents that generalize well 
to a wide range of environments. Another issue with RL 
is implementing hard state constraints. Using the example 
of liquid rocket engine control, one would like to impose 
hard constraints to limit the maximum rotational speed of the 
turbopumps and maximum temperatures to prevent damage to 
the engine. It is possible to approximate hard state constraints 
by carefully tuning the reward function, e.g. one can give the 
agent a sizeable negative reward upon constraint violation and 
possibly terminate the training episode. Besides, there has been 


recent work on implementing hard constraints in RL using 
constrained policy optimization [44]. 

We would like to conclude this publication with an outlook 
on the potential advantages of this approach for rocket engine 
control. Controllers trained with RL can depend on many input 
variables, can be used for very different operating conditions, 
and can include multiple objectives. The thrust control of 
rocket engines is crucial for improving the performance of 
the launch vehicle, but it is particularly critical when using 
rocket engines for the soft landing of returning rocket stages. 
Deep throttling domains of an engine, i.e. 25-100% range of 
nominal thrust, are not supposed to pose a problem for RL 
controllers. Regarding multiple objectives, one can modify the 
reward function to optimize both the system’s performance 
and damage mitigation [45]. The coupling of sophisticated 
health monitoring systems, possibly based on machine learning 
techniques, with suitable policies trained by RL, can increase 
the reliability of launch systems further. Given a suitable 
simulation environment, end-to-end RL may even enable the 
training of integrated flight and engine control systems. Over- 
all, it is hoped that the current work will serve as a basis for 
future studies regarding the application of RL in the field of 
rocket engine control. 
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APPENDIX | 
IMPLEMENTATION AND TRAINING DETAILS 


The agent is trained for 100000 time steps, which is equal 
to approximately 1.5 hours of simulation time. The agents’ 
hyperparameters are tuned with Optuna [46] and are presented 
in Table m] For exploration we use action noise sampled 
from an Ornstein-Uhlenbeck process [47]. Table [IV] shows the 
parameters of all three PID controllers and the corresponding 
controlled variable and control valve. 


TABLE III 
TD3 HYPERPARAMETERS 


Parameter Value 
number of hidden units per layer [400, 300] 
number of hidden layer 2 
activation function ReLU 
optimizer Adam 
number of samples per minibatch 256 
learning rate 0.001 

soft update coefficient (7) 0.005 
train frequency 10 
gradient steps 10 
discount rate (y) 0.90 
warm-up steps 5000 
total training steps 100 000 
size of the replay buffer 25 000 
target policy noise 0.01 
target noise clip 0.02 
policy delay 2 

action noise type Ornstein-Uhlenbeck 
action noise std (a) 0.05 

rate of mean reversion (0) 0.25 


TABLE IV 
PID PARAMETERS 


Valve Controlled Variable Parameter Value 
VGO MRac (—) Kp 98.5 
T 36.3 
Ta 3.56 x 1074 
VGH Dec (Pa) Kp 2.59 x 1077 
Ti 1.22 
Ta 6.82 x 1073 
VGC MRpy (—) Kp 0.786 
T; 1.06 


Ta 2.12 x 107? 


APPENDIX II 
PLOTS FOR 80 BAR CASE 


Fig. p] shows the nominal OLS for a main combustion 
chamber pressure of 80 bar. The manipulated valve positions 
by the PID and RL agent for 80 bar are shown in Fig. [6] and 
Fig. |7| Finally, Fig. |8| compares controller performances for 
different degraded turbine efficiencies. 


— VGO — VGH — VGC 

1.0 
T 0.8 
5 
B 0.6 
[o] 
à 
p 0.4 
Go] 
> 0.2 

0.0 

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 
Time (s) 
Fig. 6. Manipulated valve positions by the PID controllers for the 


80 bar nominal start-up. VGO is used to control the mixture ratio of the 
gas-generator, while VHG and VGC control the pressure of the main 
combustion chamber and the global mixture ratio respectively. 
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Fig. 7. Manipulated valve positions by the RL agent for the 80 bar 


nominal start-up. The action clearly changes at t = 2.6 s, which is the 
time when the firing of turbine starter stops. 
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Fig. 8. Comparison of the controlled variables for the 80 bar start-up. Shaded area marks the range of the controlled variable for different degraded 
efficiencies. At different turbine efficiencies the standard open-loop sequence provides significantly different steady-state values for the chamber 
pressure and the mixture ratios. 
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Fig. 9. 80bar nominal start-up sequence. The main combustion chamber pressure settles at 80 bar, while the gas-generator pressure reaches 
45 bar. The reference mixture ratios are given by 5.2 and 0.9. The engine reaches steady-state operating conditions after approximately 4s. The 
cumulative reward for the OLS is dominated by the set-point error fsp. 
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